-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement canonical domain update script #348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we can just standardise logging to using %s
as per this comment
Some thoughts/questions:
|
Hey @philbudne interesting thoughts and questions:
With this validation, execution would just remain the same without needing a lot of modification. Let me work on this and push the changes.
I'm open to making changes and utilize search_after. |
…enhance efficiency and avoid performance bottlenecks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
This PR introduces a script for updating incorrect canonical domains for documents stored in Elasticsearch.
The implementation of this script uses:
The scroll API for the retrieval of documents in batches with configurable batch size.
The Bulk Helpers to perform batch updates across multiple documents simultaneously, significantly reducing the number of requests sent to Elasticsearch.
Example of the usage of the script :
Addresses #345
====
Update
The following improvements have been done following observations from @philbudne :
To address concerns regarding the usage of scroll API the implementation has been updated to use search_after to retrieve documents that should be updated.
Support for
query_string
has been added thus the script can also be called as follows: